What’s in a Domain? Analyzing Genre and Topic Differences in SMT
نویسندگان
چکیده
Domain adaptation is an active field of research in statistical machine translation (SMT), but so far most work has ignored the distinction between the topic and genre of documents. In this paper we quantify and disentangle the impact of genre and topic differences on translation quality by introducing a new data set that has controlled topic and genre distributions. In addition, we perform a detailed analysis showing that differences across topics only explain to a limited degree translation performance differences across genres, and that genre-specific errors are more attributable to model coverage than to suboptimal scoring of translation candidates.
منابع مشابه
What's in a Domain? Analyzing Genre and Topic Differences in Statistical Machine Translation
Domain adaptation is an active field of research in statistical machine translation (SMT), but so far most work has ignored the distinction between the topic and genre of documents. In this paper we quantify and disentangle the impact of genre and topic differences on translation quality by introducing a new data set that has controlled topic and genre distributions. In addition, we perform a d...
متن کاملSelection-Based Language Model for Domain Adaptation using Topic Modeling
This paper introduces a selection-based LM using topic modeling for the purpose of domain adaptation which is often required in Statistical Machine Translation. The performance of this selection-based LM slightly outperforms the state-of-theart Moore-Lewis LM by 1.0% for EN-ES and 0.7% for ES-EN in terms of BLEU. The performance gain in terms of perplexity was 8% over the Moore-Lewis LM and 17%...
متن کاملInterpersonal Metadiscourse in Newspaper Editorials
The power of media lies in its persuasive function, which gives media a potential to maneuver on the mind of audience (van Dijk 1996). This potential is realized via different linguistic resources, one important group of which is metadiscoursal resources. The major aim of this study was to explore how and in what distribution these resources are employed by writers with different cultural backg...
متن کاملLarge SMT data-sets extracted from Wikipedia
The article presents experiments on mining Wikipedia for extracting SMT useful sentence pairs in three language pairs. Each extracted sentence pair is associated with a cross-lingual lexical similarity score based on which, several evaluations have been conducted to estimate the similarity thresholds which allow the extraction of the most useful data for training three-language pairs SMT system...
متن کاملTranslation Model Adaptation Using Genre-Revealing Text Features
Research in domain adaptation for statistical machine translation (SMT) has resulted in various approaches that adapt system components to specific translation tasks. The concept of a domain, however, is not precisely defined, and most approaches rely on provenance information or manual subcorpus labels, while genre differences have not been addressed explicitly. Motivated by the large translat...
متن کامل